Using Clustering to Identify Outlier Chunks of Text - Notebook for PAN at CLEF 2011
نویسنده
چکیده
Intrinsic plagiarism detection is a sub-task of authorship identification in which outlier chunks must be detected solely on the basis of stylistic differences from the main body of the text. We present a first attempt at utilizing words that appear infrequently in a text as stylistic markers for distinguishing outlier chunks in the text. In the first phase of our method we cluster chunks of text represented by usage of infrequent words. In the second phase, we use a training corpus to identify cluster properties of outlier chunks.
منابع مشابه
An Empirical Research: "Wikipedia Vandalism Detection using VandalSense 2.0" - Notebook for PAN at CLEF 2011
Wikipedia despite having a very small budget has been among the top ten most visited websites for over half a decade. Being this visible also generated the problem of ill intended people modifying Wikipedia in a destructive manner. VandalSense is an experimental tool programmed by F. Gediz Aksit to automatically identify vandalism on Wikipedia through the use of machine learning and text mining...
متن کاملUsing Simple Content Features for the Author Profiling Task Notebook for PAN at CLEF 2013
This paper describes the methods we have employed to solve the author profiling task at PAN-2013. Our goal was to use simple features to identify the age group and the gender of the author of a given text. We introduce the features, detail how the classifiers were trained, and how the experiments were run.
متن کاملImproved Implementation for Finding Text Similarities in Large Sets of Data - Notebook for PAN at CLEF 2011
In this article we describe a new algorithm method for the detection of plagiarism. The method removes numerous limitations of our older method, which has been used as part of a complex information system for the detection of plagiarism. The method has been tested using multiple corpora mainly in Slovak language. With the PAN-09 and PAN-10 corpora it was of great advantage that we could compare...
متن کاملExternal & Intrinsic Plagiarism Detection: VSM & Discourse Markers based Approach - Notebook for PAN at CLEF 2011
This paper aims to explain the performance of plagiarism detection system which can detect External as well as Intrinsic Plagiarism in text. It reports the results on PAN-PC-2011 test corpus. We investigated Vector Space Model based techniques for detecting external plagiarism cases and discourse markers based features to detect intrinsic plagiarism cases.
متن کاملDetecting Wikipedia Vandalism using Machine Learning - Notebook for PAN at CLEF 2011
Wikipedia vandalism identification is a very complex issue, which is now mostly solved manually by volunteers. This paper presents the main components of a system built by our group in order to automatically identify vandalized Wikipedia articles. The main component of our system is a machine learning component that uses three types of features grouped in 3 classes: Metadata, Text and Language....
متن کامل